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Abstract. Recommendations based on behavioral data may be faced 
with ambiguous statistical evidence. We consider the case of association 
rules, relevant e.g. for query and product recommendations. For example: 
Suppose that a customer belongs to categories A and B, each of which 
is known to have positive correlation with buying product C, how do we 
estimate the probability that she will buy product C? 

For rare terms or products there may not be enough data to directly 
produce such an estimate — perhaps we never directly observed a con¬ 
nection between A, B, and C. What can we do when there is no support 
for estimating the probability by simply computing the observed fre¬ 
quency? In particular, what is the right thing to do when A and B give 
rise to very different estimates of the probability of C? 

We consider the use of maximum entropy probability estimates, which 
give a principled way of extrapolating probabilities of events that do not 
even occur in the data set! Focusing on the basic case of three variables, 
our main technical contributions are that (under mild assumptions): 1) 
There exists a simple, explicit formula that gives a good approximation of 
maximum entropy estimates, and 2) Maximum entropy estimates based 
on a small number of samples are provably tightly concentrated around 
the true maximum entropy frequency that arises if we let the number of 
samples go to infinity. 

Our empirical work demonstrates the surprising precision of maximum 
entropy estimates, across a range of real-life transaction data sets. In 
particular we observe the average absolute error on maximum entropy 
estimates is a factor 3-14 less compared to using independence or ex¬ 
trapolation estimates, when the data used to make the estimates has low 
support. We believe that the same principle can be used to synthesize 
probability estimates in many settings. 
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1 Introduction 


Recommender systems that try to assess probabilities, e.g. for estimating prob¬ 
abilities based on the context of a particular user, may be faced with ambiguous 
statistical evidence. For example, consider the task: Customer is known to be¬ 
long to categories A and B, each of which is known to increase the probability of 
buying product C by 50%, how do we estimate the probability that she will buy 
product C? Is it increased by 50%, 100%, or perhaps 125%? 

Of course we may have enough data on S±, S 2 , and S 3 to make this assessment 
by computing the observed probability. But for rare terms or products there 
may not be enough data to directly produce such an estimate — perhaps we 
never directly observed a connection between Si, S 2 , and S 3 . In the extreme 
case, what can we do when there is no support for estimating the probability 
by simply computing the observed frequency? Most likely, even the number of 
observations of proper subsets of Si, S 2 , and S 3 will then be small enough that 
there is nonnegligible uncertainty about the pairwise correlations. 

The difficulty of estimating probabilities of events occurring clearly depends 
on the distribution of the input, and on how much information we have about this 
distribution. So rather than a classical approach that considers worst-case data, 
we should consider ideas from statistical analysis. The Maximum Entropy (max- 
ent) Model is a method of statistical inference that based on partial knowledge 
of a distribution provides a maximum entropy estimate. Informally, it provides a 
probability prediction based on the distribution that has “the least bias possible” 
based on the given observations. In this paper we consider the use of maximum 
entropy estimates in information retrieval contexts where estimates are sought 
of the probability that an item or term is of interest to a user. 

1.1 Motivating examples 

Movie recommendation. Suppose you know that a user loved “The Rock” 
and “The Matrix”. What is the probability that he already saw and gave 5 stars 
to “One Flew Over the Cuckoo’s Nest”? We will try to answer this question 
using the smallest possible sample of the MovieLens data set, which is examined 
further in Section [3] The difficulty is that only about 1 in 1000 people has seen 
all three movies and given them 5 stars. This means that to get a statistically 
significant answer (in absence of other information) we need to ask a very large 
set of people. Obviously, the less mainstream films you consider, the bigger this 
problem will become. See Figure [I] for visualization of the setting. 

However, it is considerably easier to obtain information on pairs of movies. 
About 2.5% of people will have seen at least two of these movies and given 
them 5 stars. This means that we can reliably estimate conditional probabilities 
based on significantly less data. Using the information that your user loved 
“The Rock” gives a probability of about 11% that he loves “One Flew Over the 
Cuckoo’s Nest”. However, if we instead use the information that he loved “The 
Matrix” we get an estimate of about 7%. It seems that both these movies make 
it more likely that he will love “One Flew Over the Cuckoo’s Nest”, but how 


do we combine these pieces of information? It seems that anywhere in the range 
11-18% is a reasonable guess. 

To resolve this ambiguity we again use a maximum entropy estimate based 
on subset frequencies. This estimate takes the correlation of “The Rock” and 
“The Matrix” into account, and arrives at an estimate of 14% based on a data 
set of 673 people in which nobody has given all three movies 5 stars. When we 
consider a data set 100 times larger it is possible to see how well this estimate 
fares: In the larger data set, 15% of those who gave 5 stars to “The Rock” and 
“The Matrix” also gave 5 stars to “One Flew Over the Cuckoo’s Nest”. We 
generally find that maximum entropy estimates are surprisingly accurate across 
a wide range of data sets from different areas. 
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Fig. 1: Two distributions of movie watchers who love three selected movies. Fig¬ 
ure la shows the observed probabilities in a sample of 1% of the data set, from 
which we want to approximate Si D S 2 H S 3 . This is consistent with S±, S 2 , and 
S 3 never occurring together. Figure [Tb] shows the probabilities in the whole data 
set, and indeed it is not the case that loving two of the movies precludes loving 
the third one. Our findings are that a maximum entropy estimate in such a case 
is well-concentrated as opposed to independence or extrapolation estimators. In 
this example an independence assumption yields an estimate of 36 occurrences, 
our rnaxent estimate yields 82, while the true number of occurrences is 90. 



Query completion. Consider the case where a search engine user types the 
words “jordan air” (followed by a space). What words should be suggested to 
complete the query? 

We consider the simple method of ignoring the order of words and rely¬ 
ing on association rules, in our case obtained from a set of 2.1 million queries 
from a major US search engine. Suppose we have two competing suggestions, 
“force” and “wholesale” (occurring in around 0.03% and 0.06% of queries, 
respectively). Around 9% of the queries that contain the word “air” also con¬ 
tain the word “force”. On the other hand, less than 0.00003% of past queries 
contain “jordan” and “force”, which means that the maximum entropy esti¬ 
mate for the probability of completing with “force” becomes less than 1%. For 
comparison, both “air” and “j ordan” significantly increase the probability that 
the word “wholesale” occurs, to around 0.3% and 1%, respectively. Figure [2] 
summarizes the association rules involving two words. 

If the combination of words had been just slightly more rare, we might have 
had no past queries containing them. Thus, for “long tail” queries we need to 
rely on other methods for estimating the likelihood of a particular completion. 

A maximum entropy estimate, or more precisely the approximation formula 
of |l]), shows that the user has a 5% probability of completing with “wholesale”. 
This is consistent with the data, which contains 31 queries including {jordan, 
air, wholesale} out of a total of 575 queries containing “jordan” and “air”. 


Assoc, rule 

Confidence 

air => force 

0.091 

jordan => wholesale 

0.010 

air => wholesale 

0.0030 

jordan => force 

0.00097 


Fig. 2: Association rules for the words force and wholesale in the query set 
jordan and air, respectively. The probabilities are sorted in decreasing order. 
In the whole data set, wholesale occurs in a fraction 0.00064 of queries, and 
force in a fraction 0.00028. By the first association rule, presence of the word 
air increases the probability of the word force hundreds of times, to over 9%. 
However, a maximum entropy estimate correctly predicts less than 1% prob¬ 
ability of seeing force when both air and jordan are present. On the other 
hand, the presence of air and jordan yield a maximum entropy probability for 
wholesale of 3%, close to the observed frequency of 5%. 


We will argue by experimental evidence that in scenarios such as the one 
above, the maximum entropy estimate will give a better prediction than both 
extrapolation and the independence model (see Section 2.2 for definitions), while 
still being efficiently computable. 






1.2 Our results 


Problem definition. We consider the problem of estimating probabilities of 
conjunctions of boolean random variables, where each such conjunction occurs a 
statistically insignificant number of times in a data set of samples from the joint 
distribution (e.g., given by a complete data set). For some big data set V we 
consider a sample D C T>. Given such D we wish to estimate event frequencies 
of V also in the difficult cases where the events do not occur in D. In particular 
we will focus on triples: Let I be the set of possible items, |/| = n, and D be 
an m x n binary data set where each of the m rows Di encodes a transaction 
D t C I. For all singleton- and pair-subsets of I we assume that we know the 
number of transactions which they occur in, i.e., all singleton and pair frequencies 
are known. For each X C /, \X\ = 3 where the frequency 6 x in D is 0 we then 
wish to estimate 6 x in T>. 

Our contribution. We consider triple frequency estimation based on the 
principle of maximum entropy. Our main theoretical result is that a maximum 
entropy estimate based on a sample, which implies that the frequencies used as 
input to the estimator will have some relative error e, will yield an estimate close 
to the true triple frequency under the maximum entropy assumption. We show 
this through a surprisingly simple estimator p that approximates the maximum 
entropy estimate well when triple frequencies are small. 

Theorem 1. Consider boolean random variables X,Y,W where Pr(X), Pr(T), 
Pr(W), Pr (XY), Pr (XW) and Pr(TIT) are given. Assume that the maximum 
entropy distribution consistent with these probabilities satisfies 

Pr (XYW) < f min ( Pr (XT W) , Pr( XYW), Pr ( XYW) , Pr (XYW) , 
Pt{XYW),Pi(XYW),Pi(X7W)) . 

Then given probability estimates p... such that 

P'xyw Pxyw Pxyw P~xyw P'xyW PxyW Pxyw 
Pr (XYW) ’ Pr(xyw) ’ Pr(XFVP) ’ Pr {XYW) ’ Pr {XYW ) 5 Pr (XYW) ’ Pr (XYW) 

€ [(1 ~ e ); (1 + £ )L 

it holds that 

P _ PXYwPXYwPxrwPXYW 

PxyWPxyWPxyw 

e [(1 - 0(e)) Pr {XYW), (1 + 0{e)) Pr {XYW)] 

It follows from Theorem |T] that a) using sampled data to perform maximum en¬ 
tropy estimates of probabilities in the bigger data is theoretically well-founded 
b) there is a simple explicit estimator, p, that approximates the maximum en¬ 
tropy estimate in the interesting case where the triple frequency is significantly 
smaller than the pair frequencies. 

It is instructive to consider a less precise, but even simpler estimator for 
the case where, informally, there is no strong positive correlation among X , 















Y, and W, and Pr (XYW) is close to 1. Then P\yw/Pp X yw ~ Py\x, the 
observed probability of Y given A', and similarly Pxyw/Ppxyw ~ Px\w an d 
Pxyw/Ppxyw ~ Pw\Yi so we can approximate the triple frequency by: 


P* = Py\xPx\wPw\y (1) 

Applying 0 to estimate Pr(W\XY), we get the estimator 

P # — Py\xPx\wPw\y/Pxy = Pw\xPw\y /Pw\ 

that is, the factors by which conditioning on X and Y influence the probability 
of W get multiplied. 

Empirical study. Our experimental evaluation on real data sets shows that 
maximum entropy estimates give meaningful, and often quite precise, frequency 
predictions also in cases where the independence estimate 9 l x and the extrap¬ 
olation estimate 9 e x do not. The error of the estimator is well modeled by the 
assumption that transactions are independently sampled from a distribution 
having the estimated subset probabilities. 

Overview. In Secti on | '2.1| we introduce basic terms and notation followed by 
a description in Section |2.3| of how the maximum entropy estimate of an item set 


is computed. We then prove in Equation (14) that the maxent estimate is not 


too sensitive to error on the input distribution. For the experimental evaluation 
we first show in Section |3.1| that the maximum entropy estimate achieves better 


concentration in general and then in Section 3.2 we discuss results on item sets 
of low statistically insignificant support. 


1.3 Related work 

The principle of maximum entropy dates far back, but was introduced to infor¬ 
mation theory in a seminal work by E. T. Jaynes [F]. It has since seen applications 
in a large number of areas. 

The maximum entropy distribution of n random variables is known to be 
computable in time exponential in n using the well-known Iterative Scaling al¬ 
gorithm [2] . The running time is due to the fact that for n variables, there are 
2" subsets of variables. In the general case, that is with no knowledge of the 
distribution, Tatti has proved that querying the model is PP-hard uni, which is 
(believed to be) harder than NP. 

Association rule mining 03 is a well-known and extensively studied prob¬ 
lem, where a rule has the form X => Y with A, Y being disjoint subsets of 
random variables. In transactional data sets association rule mining tradition¬ 
ally relies on finding frequent itemsets , i.e., for some set of items / and a set of 
transactions D over I then one wishes to report back the sets X C I that are 
contained in more than s transactions, for a fixed threshold s. 

The maxent distribution has been used as a model to measure how significant 
an itemset is, in the framework of frequent itemset mining, e.g. cm- The general 
approach is to compute the maximum entropy distribution (via the Iterative 





Scaling algorithm) and then compute the Kullback-Leibler divergence with the 
empirical distribution, from which a p -value can be found that is used to rank 
the item sets. Our approach is that we observe some sample of the subsets of 
the set of interest and then use these subsets to efficiently compute the maxent 
estimate. 

For the case where all frequencies of the strict subsets are known, the maxi¬ 
mum entropy model has been used by R. Meo [9] by comparing the probability 
estimate under maximum entropy to the empirical probability in order to achieve 
a measure of ranking an itemset. One of the main open problems of [9] was de¬ 
termining the existence of a closed form formula for a maxent estimate given the 
subset probabilities. This was partially resolved in [5], where the author provides 
a formula for the 1-dimensional search space in which the maxent estimate lies, 
which is then traversed by binary search that is shown to converge to the maxent 
estimate. 

Comparisons. Our estimator takes as parameters all the single and pair- 
frequencies. The hypothesis present implicitly in our model is that data generally 
has weak third-order dependencies. This is also the reason Chow-Liu trees j3|, 
which model only first and second-order dependencies, are known to be a good 
approximations of many observed distributions. The maximum entropy estimate 
used can be seen as an application of [9j, where singleton and pair frequencies 
are used to efficiently compute a maxent estimate in the interesting case where 
the estimand has no observed occurrences. One of the main open problems of [9] 
is an explicit formula for the maxent estimate - Theorem |T] in this paper shows 
an explicit formula for an approximation of the maxent estimate under certain 
conditions. As the II? search of the maximum entropy estimate 9™ is determined 
when having all subset frequencies of X , we can compute a good maximum 
entropy estimate using a small constant number of iterations of binary search as 
opposed to computing the full maximum entropy distribution. We give a proof of 
this that is similar to that of Meo [8], with the distinction that where she shows 
that there exists constants such that the maxent estimate can be classified by 
setting particular equations to be equal to each other, in our proof the constants 
are explicitly stated. 

2 Frequency estimates of itemsets 

2.1 Preliminaries 

We provide some definitions and notation that will be used throughout the paper. 

Remember that a boolean random variable A is a variable with values in 
{0,1}. A binary data set D of observations of boolean random variables is an 
m x n matrix consisting of m binary n-sized vectors, index 0 < i < n — 1 
corresponding to the outcome of boolean random variable A,. For a particular 
subset of the boolean variables S C [n] a binary vector uj covers S iff A^ = 1 
implies u>i for every i £ S. The frequency 9s of a set of boolean variables is the 
proportion of the m row vectors in D that covers S. 


A distribution p over data D is mapping 


p:{0,l}M[0,l]s.t. X] PM = 1 - 

ug{0,l}" 

For a distribution p and a vector of Is v we denote by p(S = v) = p(S = 1) = 9g 
the probability Pr(w covers S). 

The empirical distribution over data set D is given as 

t ^ \{t€D\t = v}\ 

q D {ai = vx,... ,a n = v n ) =- (2) 

m 

We will denote by empirical frequency the frequency according to the empirical 
distribution 

Let a family of random variable sets T be satisfied by a distribution p : 
{0,1}” i->- [0,1] iff for every S' € J 7 it holds that p(co covers S) = Os- 

Finally, we say that a set of binary variables X is downard closed if for 
all strict subsets S C I we have Os, e.g., if we consider the triple of items 
s = {Ii, I 2113 } then s is downward closed if we know the empirical singleton 
frequencies 0i 1 ,0i 2 ,0i 3 and empirical pair frequencies 0ij 1 i 2 \, ${/ 2 ,/ 3 }i ®{ii,i 3 }- 
We consider specifically the case where all itemsets of size 3 (triples) are 
downward closed and the triples are the itemsets which we wish to estimate the 
frequency of. 

2.2 Estimation by extrapolation and independence assumption 

We briefly describe the two estimators used for comparison. 

Let X = {/ 1 ,/ 2 ,i 3 } C I, |X| = 3, be the triple of interest from sampled 
data set D C V. The independence model assumes the occurrences of random 
variables to not be correlated, thus we have: 

°x = 

For the extrapolation estimator, let occ(X) be the number of occurrences of X 
in D. To estimate the frequency Ox in 2? we have: 

0 e x = occ(X)/\D\ 

Following from Clrernoff bounds on independent variables 0 e x is known to be a 
good estimate when 0 X is significant in D. 

2.3 Maximum entropy of itemsets 

We are interested in estimating the frequency of a specific set of items G C I. We 
will estimate such a frequency using the maximum entropy distribution, which 
intuitively can be thought of as the most uniform distribution given some ob¬ 
served frequencies. We will compute the needed entries of the maximum entropy 
distribution to be used to compare to the empirical distribution. 



For a family of itemsets T and a variable set G let the projected family Tq 
be defined as 

T G = {X&F\ 

Then letting V be the set of all possible probability distributions that satisfy 
the entropy of a distribution p is given as 

H (p) = - E pM 1o spM (3) 

^e{o,i}i G 

The maximum entropy distribution p* can be found by maximizing H ( p ) over 
all p £ V 

p* = arg max H (p) (4) 

p&V 

We note that \Tq\ = O (2l & 'l) and if Tq = 0 then p* is the uniform distribution. 
The set V contains the empirical distribution (Equation ([2])) and hence is non¬ 
empty by construction. We shall denote by maximum entropy estimate Ofl of 
an itemset X the frequency of the itemset according to the maximum entropy 
distribution. 


Classifying and computing the maxent estimate For completeness we will 
state the approach used to compute the maximum entropy estimate of itemset 
frequency. We note that a similar proof of how to find the maxent estimate 
appears in [5j, however we give explicit constants in Equation (14) whereas their 
proof shows existence of the constants. 

High-level description. The overview of the proof is that for any z > 1 
variables, when {Ox \ X C G} is given for each variable then the subspace of 
the joint probability space of z random boolean variables is 1-dimensional. It 
follows from this that the frequency estimate according to the maximum entropy 
distribution p* is located on a line segment. The estimate thus be computed by 
doing a simple binary search on this line segment and the main point of doing 
this is that we avoid computing the entire maximum entropy distribution. 

Given z random boolean variables and the marginal probabilities {Ox | -Y C 
G} then the joint probability space is given by 2 2 linear equations, where the 
rank of the corresponding matrix is 2 Z — 1. Let the z boolean random vari¬ 
ables A\,...,A Z have probabilities Pr[Ai],..., Pr[A z ]. For x C [z\, y = [z] \ x 
then denote by /= the probability Pr [V iea; Aj = 1 A Vj ey Aj = 0]. and by 0 X the 
probability Pr [Vj ea; Ai = 1]. 


Lemma 1. For random z boolean variables Ai,...,A z with feasible marginal 
probabilities {Ox \X C [ 2 ]} then the space of feasible distributions is determined 
entirely by /^. More generally, for S C [ 2 ] then fg follows from Equation Q. 

fs = E(— 

S'DS 


( 5 ) 



Proof. We prove this by induction on \S\. For (Sj = * we have 


fs = £ (-1)' SV % 

S'DS 


For the inductive step we assume Equation (|5j) to hold for |> i. Then for 
| S'| = i, Equation © holds as the sum is over supersets of S. By applying 
Equation (51 to the second term of the right hand side of Equation © we get 
Equation (7). 


fs 

fs 


Os- £ fs> 

S'DS 

Os-E E (-!) |S ' V V 

S'DSS"DS' 


( 6 ) 

(7) 


The double sum of Equation 0 can be split into two as shown in Equation © 


fs = Os 


E E ( _1 )' sxs ~ E 

SQSS'QS' S’=S 


E (-i) |s "\ s,| 0s« 

S"DS ' 


( 8 ) 


The first double sum can be split into two parts 

£ £ (-i)l s "' s V + £ £ (-i)l s "' s 'l<v 

S'DSS"DS' S’=SS"=S' 

the first for which we will use Fact [I] below. 

Fact 1 For a set space X and function g : X —> R, for a double sum of sign- 
alternating supersets of any x £ X we have 

E E (-i)i x " w| < 7 (z) = o 

x'Dx x"Z)x' 

due to summands canceling out. 

Then by application of Fact [l] to Equation © we get Equation © , where the 
first double sum consists only of the element 9s, hence we now arrive at the 
induction basis in Equation (|To|. 


fs = Os 


£ £ (-i)l s "' s V - £ £ (-i)l s "' s V 


(9) 


fs 


fs 



£ £ (-ir' s V 

S'=SS"DS' 


S"DS 


( 10 ) 




Intuitively, Lemma [l] states that the space of probability distributions that sat¬ 
isfy the marginal probabilities can be traversed by varying /^. We note that 
Lemma[l]was used earlier by Calders & Goethals [2J eq. (1)] and that we include 
the proof for sake of completeness. 

We shall now argue that on this line through 2 2 dimensional space, there is 
a unique point, i.e. a unique feasible distribution p, that maximizes the entropy 
H ( p ) and thus computing this point allows us to query the the maximum entropy 
distribution p* for a z-sized set of variables. Given a feasible distribution x 
consisting of entries xs > 0 for every S C [z], then by Lemma [l] the feasible 
distribution space Vf can be traversed by Equation (111. 


Vf = x + t ■ v, t G ( l,r ) 


( 11 ) 


where v is a 2 2 -sized {—1, +l}-vector with entries vs = (—for every 
S C [z] and the range (l,r) is the range for which all coordinates in the vector 
x + t ■ v are non-negative. Consider the partitioning of subsets S C [z] into 
Seven = {S'| (z — |Sj)is even} and S 0 dd = {S \ (z — |Sj)is odd}. Then the t- 
value borders are given below. 


I — maxjxg | S € S e ven} 
r = min{a"s | S G S Q dd} 


We note that feasible solution x exists by construction since we have observed 
a feasible distribution and that vector v corresponds to the null space of the 
2 2 -row matrix denoting the linear equalities which the joint distribution adheres 
to. 

Location of the unique point corresponding to querying the max entropy 
distribution p* is shown in Lemma [2] below. 


Lemma 2. For a feasible distribution space Vf 
a point p max G V f s.t. Wp G V fl p ± p max => 
Equation (13) below holds. 


= x +1 ■ v, t G (l,r) there exists 
H{Pmax ) > H(p). In particular, 


d 

df 


H(x + t ■ v) 


^ (^(x s +(- i r |s| t) i °g( 


xs + (—l) 2- ! 5 ^ 


^ (-l) 2 -' 5 ' log 

SC[ 2 ] 


Xs + (—l) 2 


( 12 ) 

(13) 


Proof. We wish to show the existence of the scalar t 
entropy, i.e. p rn ax — X F t max 
rules, from which we arrive at 


that optimizes the 
v. Equation (13) follows form standard deriviate 


±H(x+t-v)= y : (-i ) 2 -' 51 

SC[ 2 ] 



Xs + (—l) 2_ l S lt 


(xs + ( —1) Z |S| t) 2 \ 
{xs + (-l) 2_|s| 0 2 / 


where the rightmost term will cancel out due to there being an equal number of 
odd- and even-sized subsets of [z] for any integer z. 









Corollary 1. The value tf £ (l, r ) that maximizes H(x + t ■ v) while x + t ■ v is 
a feasible solution can be found as tf = medianjZ, r, t max }. 


Proof. From Lemma [2] we have that t max is the t -value that maximizes the 


entropy function H(x+t-v), which is at its unique maximum when Equation (14) 
holds. 

log ( ( 14 ) 


E 

SCSodd 


log 


1 

X S -t 


S £ 'S'even 


x s +1 


Let tf = median{Z, r, t max \- For the line segment spanned by end points l and r, 
we have 3 cases: t max < l then tf =1,1 < t max < r then tf = tmax and t max > f 
then tf = r. The median of the values l,r,t max distinguishes these cases. 


The t for which Equation (14) holds is f max , that is, the equation holds 
strictly under maximum entropy. Since the solution space space of (l, r) is always 
1-dimensional if all frequencies of the 2 Z — 1 subsets are given, then we compute 
the value t max by binary search in the space. We note that this search is an 
approximation, but the binary search converges to values arbitrarily close to 
tmax quickly in practice, e.g., 30 iterations of binary search was used to produce 
the results in this paper. For an itemset X, the f-value we hold after 30 such 
iterations in the search space is then the maximum entropy estimate 

We note that by using a constant number of iterations for each binary search 
the computation takes time linear in the number of itemsets for which we wish 
to compute the maximum entropy estimate. Up to a constant factor this is 
equivalent to the extrapolation and independence assumption estimators. 


Maxent on noisy inputs Recall that a view on data is that all data comes 
from some smaller sample D sampled independently from a larger data set T>. 
As we wish to use the data from D to reason about triples from V we will show 
that the error introduced by sampling does not hurt the maxent estimate too 
much, in particular we will show that a maxent estimate based on D will not 
be far from a maxent estimate based on T>. We will show that the maximum 
entropy estimate on a triple computed on input with relative error 0 < e < 0.5 
is only a factor 0 (e) from the maximum entropy estimate computed using the 
true distribution from T> as input. 

For a distribution d over 3 variables X , Y. W we have 

d = (Pr (XYW), Pr (XY AW),..., Pr (XYW)), 

i.e., |d| = 2 3 . Let |d,| be the number of non-negated literals in di. The entries 
of d can (as in Equation Q) be partitioned into two sets, u = iq,... ,u\ and 
l = l\,..., I 4 where di £ u if 3 — |dj| is odd and di £ l otherwise. 

Let two polynomials L El (t) and U El (t) with error 0 < |ei| < 1 be defined 

L e 1 {t) = (li + £1 + t.)(l2 + £1 + t)(l 3 + El + t)t (15) 

u ei (t) = (ui T £1 t)(u 2 +£l - t)(u 3 +£1 - t)(u 4 + £1 - t) (16) 







where t = Vr{XYW ) is the triple frequency. When we have 


L 0 (t) — Uo(t) 


(17) 


then t is the triple frequency under maximum entropy. We seek to bound the 
error on the output caused by the input error E\. Letting to be the solution to 
Lo(t) = Uo(t) and t ei be the solution to L ei (t) = U El (t) we wish to bound the 
error on t Sl in terms to- We will show the following lemma. 

Lemma 3. Let two polynomials with additive error t on the terms and where 
0 < li, Ui, < 1 be defined as below. 

L(t) = (li + t)(J ,2 + t )(?3 + t) (18) 

U{t) = (ui - t){u 2 ~ t)(u 3 - t)(u 4 - t) (19) 


For all t € [0; eminuj,..., U 4 , l \,..., Z 3 ], where 0 < e < 1, the following bounds 
on Equations (15) and (16) hold 


L(0)t < L 0 (t) < (l+e) 3 L( 0 )t 
(1 - e) 4 t/ 0 ( 0 ) < U[t) < Uo( 0) 


( 20 ) 

( 21 ) 


Proof. The first inequality of Eq. (20) follows trivially from t > 0 and from 
L being monotonically increasing in t. For the second inequality let Z min = 
min/i,..., l 3 and t < el m i n , 0 < e < 1. It follows that 


Lq (t) ^ (^1 + ^min)(^2 “b ^^min)(^3 T ^Imin )-briin 

< (1 + E) 3 l\l 2 l 3 t 
= (l + £) 3 L(0)f 


The first inequality of Eq. (21) follows analogously; let u n 
and t < £M m in, 0 < e < 1. We then have 


= mm tii, 


,«4 


U(t) > (wi 


^timin)(ti2 ^ti m j n )(ti3 £ti m i n )(ti4 £ti m j n ) 


> (1 - e) 4 U o (0) 


The second inequality again follows from t > 0 and U being monotonically 
decreasing in t. 

It follows that a simple approximate formula for maximum entropy estimates on 
triples exist. 

Lemma 4. For a triple with distribution over entries T> = li,..., l 3 , m,..., U 4 
let 

7 _ UiU 2 U 3 U4 

I 1 I 2 I 3 

The triple frequency under maximum entropy t < emin \ for 0 < e < 0.5 can be 
bounded in terms oft 


t € [(1 — 6.5e)t, f] 


(22) 







Proof. Recall that t takes value under maximum entropy when Uo(t) / Lo(t) = 1. 
By Lemma [3] we have a lower bound on Uo(t) / L 0 (t) 

(1 - e) 4 t/ o (0)/(l + 5) 3 L( 0) < U(t)o/L 0 {t) 

(1 — e) 4 Miu 2 u 3 /(l + E) 3 hhh < Uo(t)/L 0 (t) 

( 1^ £ ) 4 


(l + e) : 


r t < U 0 (t)/L 0 (t) 


By expansion and using 0 < e < 0.5 we get the needed lower bound of Eq. (22). 

(1-e) 4 

1 - C * £ > 7--77 

(1 + £) 3 

c P —£ 3 ~K 5e 2 — 3s 7 

c> 6.5 


Equivalently we have the needed upper bound 

U 0 (t)/L 0 (t) < U o (0)/L(0) 
Uo(t)/L(t) < U 1 U 2 U 3 /I 1 I 2 I 3 
U 0 (t)/L(t) < t 


We will now assume relative error on the distribution entries h,Ui. The following 
lemma holds analogously to Lemma [4] 

Lemma 5 . Let distribution d = (l\,... I3, u±, . .., U4). Let an approximation of 
distribution d be defined by u! i £ [(1 — e)ui, (1 + e)ui\ and l[ £ [(1 — e)h, (1 + e)lj\ 
for each Ui,k £ d. 




t = 


u'^u^u'^u'^ 

UiU 2 U 3 U4 


‘ 1 ‘ 2‘3 


It holds that 

t r € [(1 — 6.5 e')i, t] 

Proof. (Theorem |T|) The theorem follows directly from Lemmas [4] and [5j 


(23) 


3 Experimental Results 

Our experiments were conducted on 5 real datasets shown in Table [l] For each 
dataset we prune the singletons with support below a specified threshold. We 
perform this pruning in order to construct datasets where intuitively the inde¬ 
pendence model has a chance to do well as it relies solely on the singletons, but 
also to keep the number of items down to a practical level, as the time complex¬ 
ity with n items is 0(n 3 ) per experiment. AOL Queries^] is a uniform sample 

1 http://www.gregsadetsky.com/aol-data/ 









of the infamous AOL seach terms dataset. Docwords is transactions of words 
occurring together in documents. From MovieLens [^] we create a dataset where 
a transaction is the set of movies which a particular user rated 5/5 stars. Retail 
[^]is shopping baskets from an anonymous Belgian supermarket. 

Overview of experiments. We start by showing how the independence 
and maxent estimators perform on the full datasets, i.e., when there is significant 
support for the triple whose frequency is to be estimated as well as its subsets. 
The maxent estimator shows better concentration and less variance even for the 
low-support triples. We then perform experiments for all datasets where 1% of 
the transactions are sampled and we wish to estimate the frequency of triples in 
the whole data set. For every triple X with 30 < occ(X) < 100 we use the three 
estimators 9 x , 9 l x and 9\. As X occurs at most 100 times in the full dataset, it 
occurs in expectation at most once in the sample. 

We also study precision and recall for the problem of approximating the set 
of frequent triples based on the estimators. Again we consider two cases: 1) 
Estimates are based on the full dataset, where the set to approximate consists 
of the 10% highest frequencies among all triples, and 2) as in the first case, but 
with estimates based on a 1% sample. In the latter case, if the threshold for being 
in top 10% is A , then we include in the estimate all triples that are estimated 
to have at least 0.9Z\ occurrences. The number 0.9 was experimentally found to 
yield a good precision/recall tradeoff for the maximum entropy estimates. 

Finally, using again a 1% sample of the data, we compute the average ratio 
between the error made by our maximum entropy estimate and estimates made 
by independence and extrapolation, respectively. 

In summary our experiments show: 

1. In most cases, the maximum entropy estimator provides the best estimate 
for low-support triples. 

2. Sampling with higher probability increases concentration greatly for 9 X even 
when the sample is still of insufficient size for 9 e x to be useful. 

3. In almost all cases studied, the precision and recall are strictly better for 9\ 
(see Tables [2] and [3]) . 

4. When predicting the occurences of triples in the full dataset using a 1% 
sample, we show 9 X improves the absolute error by a factor 3-14 compared 
to 9 X and 9\ (see Table [ 4 ]). 


3.1 Maxent vs. independence for full datasets 

For all datasets we ran the estimators on every triple with occ > 30 on the full 
(unsampled) dataset. The figures show the concentration of the estimates by for 
each triple plotting its observed value (Y-axis) and estimated value (X-axis). 
The plots are shown in Figures [3] and [4] 

2 http://www.grouplens.org/node/73 

3 http://fimi.ua.ac.be/data/ 



Name 

Occ. Threshold ^Items ^Transactions 

AOL Queries 

500 

211 

144038 

Docwords 

3000 

142 

49078 

Movielens 

3000 

85 

67312 

Retail 

500 

85 

88162 


Table 1: Datasets used. All datasets are from real data and have previously been 
used for data mining purposes. 


Dataset 

Independence Maxent 

Precision Recall Precision Recall 

AOL Queries 

0.0 

0.0 

0.54 

1.0 

Docwords 

0.93 

0.43 

0.92 

0.93 

MovieLens 

1.00 

0.003 

0.80 

0.96 

Retail 

1.00 

0.43 

0.99 

0.97 


Table 2: Precision-recall for full datasets. Relevant triple threshold A is such 
that relevant triples are among the 10% most frequent. 


For the Docwords dataset both the maxent estimate (Figure 3a) and the 
independence estimate (Figure [3b]) are concentrated around the X = Y, which 
denote the line of optimal estimations while for MovieLens we observe similar 
concentration for maxent (Figure [3c| while the independence estimator underes¬ 
timates slightly (Figure [3d|) and Retail behaves equivalently in this setting. For 
AOL Queries (Figure [4]) using independence estimates we observe similar great 
under estimation due to high positive correlation, while the maxent estimator 
overestimates slightly but is far more concentrated. 

Our precision and recall computations (Table [2]) is based on relevance thresh¬ 
old A being such that relevant triples are among the top 10% most frequent and 
we report a triple if the estimate is at least A. Note that the high precision 
for the independence estimate is due to high underestimation - the estimators 
report too few triples as relevant, as the recall shows. An extreme case of this 
is the AOL dataset that reports zero triples. Maxent has similar precision but 
much higher recall for all datasets. 


Dataset 

Ind. Maxent Extrapol. 

Prec Rec Prec Rec Prec Rec 

AOL Queries 
Docwords 
MovieLens 
Retail 

1.0 0.001 0.23 0.89 0.31 0.67 

0.88 0.44 0.56 0.72 0.53 0.58 

1.0 0.009 0.51 0.90 0.49 0.85 
0.93 0.40 0.51 0.78 0.49 0.73 


Table 3: Precision-recall for sampled datasets. Relevant triple threshold A is 
such that relevant triples are among the 10% most frequent. Triples are reported 
when occ > 0.9Z\ as this was observed to maximize precision/recall for 0 e x . 












Dataset 

Independence Extrapolation 

AOL Queries 

7.91 

7.58 

Docwords 

6.31 

14.32 

MovieLens 

11.55 

7.22 

Retail 

3.22 

4.42 


Table 4: For n estimates and two estimators est\ and est 2 the table shows the 
normalized absolute error ratio: - J2x (|esti(Y) — occ(X)\/\est 2 (X) — occ(X )|), 
where est 2 is maximum entropy and est\ is independence and extrapolation 
respectively for the two columns. 



(a) Max. entropy estimates for all triples of (b) Independence estimates for all triples of 
occ > 30 in Docwords. occ > 30 in Docwords. 



(c) Max. entropy estimates for all triples of (d) Independence estimates for all triples of 
occ > 30 in Movielens. occ > 30 in Movielens. 

Fig. 3: Concentration plots for Docwords and Movielens datasets. Each point is 
a triple, the Y-value of a point is the empirical number of occurrences (occ) 
of the triple while the X-value is the estimated number of occurences, using 
either independence or maxent estimators. The red line is X=Y. We observe 
better concentration on the maxent estimates, in particular when the statistical 
significance is high. 











(a) Maxent estimates for all triples of occ (b) Independence estimates for all triples of 
> 30 in AOL Queries. occ > 30 in AOL Queries. 

Fig. 4: Concentration plots for AOL Queries. Each point is a triple, the Y-value 
of a point is the empirical number of occurrences (occ) of the triple while the 
X-value is the estimated number of occurences, using either independence or 
maxent estimators. The red line is X=Y. We observe better concentration for 
Maxent for all occurences. 


3.2 Low-support itemsets 

On the same datasets we now examine triples X where 30 < occ(X) < 100 and 
we sample independently at random every transaction with probability 1/100. 
We wish to estimate the 9x in the full dataset, but since we have occ(X) < 100 
then following from independent sampling the expected number of occurrences 
of X in our sample is < 1. We will perform estimates using the extrapolation 
estimate 9 X , the independence estimate 9 X and the maxent estimate 9 X . We 
restrict ourselves to triples where all pairs occur in the sample. 

The extrapolation estimator 9\ performs similarly on all datasets. While 9\ 
is an unbiased estimator of 9 X , the variance is large as a consequence of the sam¬ 
pling - by the mode of independent Bernoulli trials we have that the most likely 
outcome of a triple is [/r] =1, with the outcome 0 and 2 slightly less probable. 
On the AOL, Retail and MovieLens datasets our experiments conclude similarly: 
9 X doesn’t give meaningful estimates, e.g., zero for the unsampled triples and 
overestimates on the sampled triples, while 9 X underestimates greatly and 9™ is 
fairly concentrated. See Figure[5]for example of all three estimators on AOL. For 
Docwords in this setting, we get better concentration for 9 X than 9 X - this can 
be explained by pair frequencies being more vulnerable to noise introduced by 
sampling than single frequencies and 9 l x being well concentrated for Docwords 
(recall Figure [3b|. However, we observe that if we raise the sampling probability 
to 1/20 from 1/100 then 9 X has better concentration. In summary, there is a 
sampling rate where 9 X is very poorly concentrated where 9 X outperforms 9\ 
on all datasets. 

Precision and recall values (Table [3]) were computed by setting the relevance 
threshold A to be s.t. if occ(X ) > A then triple X is among the 10% most 




frequent triples in the entire dataset with occ(X) > 30. Triples are reported if 
the occurrence estimate, which is based on the 1/100 independent sample of all 
transactions, is at least 0.9Z\ as this value was observed experimentally to yield a 
good tradeoff. AOL and MovieLens using 9 X has full precision due to reporting 
very few triples. Note that 9\ has an advantage in terms of precision/recall due 
to it mostly performing overestimates, i.e., if a triple occurs in the sample then 
it will likely be reported - even so we see 9 X being better than 0 e x except for 
AOL where only recall is better. 

In the same setting, i.e., triples where occ(X) > 30 and using an indepen¬ 
dent 1% sample we compute the normalized ratio of the absolute error between 
estimators 9 l x , 9 X and 9 X . That is, letting est m (X),esti(X) denote estimates 
of triple X by rnaxent and independence respectively, the normalized ratio is 
i J2 X | ^st i \ v )~-occ ( y j| • This experiments show (see Table |tJ) that for all datasets 
in this setting we would improve, on average, our absolute error by a factor of 3 
- 14 by switching from independence or extrapolation to maximum entropy. 



Triple 


Fig. 5: est/obs distribution plots for AOL Queries sampled at 1/100. We observe 
that independence underestimates greatly, while rnaxent has better concentra¬ 
tion than extrapolation, in particular for the low occurence triples. 
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